Goto

Collaborating Authors

 north star


The Elephant in the Room -- Why AI Safety Demands Diverse Teams

Rostcheck, David, Scheibling, Lara

arXiv.org Artificial Intelligence

We consider that existing approaches to AI "safety" and "alignment" may not be using the most effective tools, teams, or approaches. We suggest that an alternative and better approach to the problem may be to treat alignment as a social science problem, since the social sciences enjoy a rich toolkit of models for understanding and aligning motivation and behavior, much of which could be repurposed to problems involving AI models, and enumerate reasons why this is so. We introduce an alternate alignment approach informed by social science tools and characterized by three steps: 1. defining a positive desired social outcome for human/AI collaboration as the goal or "North Star," 2. properly framing knowns and unknowns, and 3. forming diverse teams to investigate, observe, and navigate emerging challenges in alignment.


Learning Metrics that Maximise Power for Accelerated A/B-Tests

Jeunen, Olivier, Ustimenko, Aleksei

arXiv.org Artificial Intelligence

Online controlled experiments are a crucial tool to allow for confident decision-making in technology companies. A North Star metric is defined (such as long-term revenue or user retention), and system variants that statistically significantly improve on this metric in an A/B-test can be considered superior. North Star metrics are typically delayed and insensitive. As a result, the cost of experimentation is high: experiments need to run for a long time, and even then, type-II errors (i.e. false negatives) are prevalent. We propose to tackle this by learning metrics from short-term signals that directly maximise the statistical power they harness with respect to the North Star. We show that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimise the $p$-values a metric would have produced on a log of past experiments. We collect such datasets from two social media applications with over 160 million Monthly Active Users each, totalling over 153 A/B-pairs. Empirical results show that we are able to increase statistical power by up to 78% when using our learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star. Alternatively, we can obtain constant statistical power at a sample size that is down to 12% of what the North Star requires, significantly reducing the cost of experimentation.


My North Star for the Future of AI

The Atlantic - Technology

Whatever academics like me thought artificial intelligence was, or what it might become, one thing is now undeniable: It is no longer ours to control. As a computer science professor at Stanford, it had been a private obsession of mine--a layer of thoughts that superimposed itself quietly over my view of the world. By the mid-2010s, however, the cultural preoccupation with AI had become deafeningly public. Billboards along Highway 101 on the California coast heralded the hiring sprees of AI start-ups. I'd hear fragments of conversation about AI on my car radio as I changed stations. The little red couch in my office, where so many of the projects that had defined our lab's reputation had been conceived, was becoming the place where I'd regularly plead with younger researchers to keep some room in their studies for the foundational texts upon which our science was built. I'd noticed, first to my annoyance and then to my concern, how consistently those texts were being neglected as the ever-accelerating advances of the moment drew everyone's attention to more topical sources of information.


Choosing a Proxy Metric from Past Experiments

Tripuraneni, Nilesh, Richardson, Lee, D'Amour, Alexander, Soriano, Jacopo, Yadlowsky, Steve

arXiv.org Machine Learning

In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.


Pareto optimal proxy metrics

Richardson, Lee, Zito, Alessandro, Greaves, Dylan, Soriano, Jacopo

arXiv.org Artificial Intelligence

North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.


Alvarez & Marsal Launches AI powered Digital Agency A&MPLIFY

#artificialintelligence

Leading global professional service firm Alvarez & Marsal (A&M), launched A&MPLIFY by Alvarez & Marsal, a digital agency charged with helping corporate and private equity clients uncover new revenue streams and develop new customer experiences through digital disruption. This mission is powered by a set of scalable digital tools and a deep bench of experienced digital thinkers, combined with A&M's extensive industry experience and heritage of execution excellence. "A&MPLIFY's north star is breakthrough growth for clients. Our approach builds on A&M's operational backbone and results-driven mindset. Our offerings dovetail with clients seeking to drive their businesses forward through meaningful impacts that make their customers' lives better" A&MPLIFY is led by A&M Managing Director, Bob Ghafouri.


How AI And Human Intelligence Will Beat Cancer - AI Summary

#artificialintelligence

For context, Go is a board game previously thought to require too much human intuition for a computer to succeed in, and as a result, it was a North Star for AI. Centuries ago, scientists and doctors operated largely in the dark when attempting to cure diseases and had to rely solely on their intuition. Many current and past approaches in the field relied on a single researcher or academic group's intuition for prioritizing which genes to test edit. Recently, with advances in high-throughput single-cell CRISPR sequencing methods, we are nearing the possibility of simply testing all genes simultaneously on equal footing and in various experimental scenarios. In fact, we predict that in the next 10 years, we will have an equivalent of a Move 37 against cancer: a therapy that at first may seem counterintuitive (and at which human intuition alone would not arrive) but that in the end, shocks us all and wins the game for patients.


AI experts establish the "North Star" for the domestic robotics field

AITopics Custom Links

Robots that do everything from helping people get dressed in the morning to washing (and putting away) the dishes have been a dream for as long people have uttered the words "artificial intelligence." But, in a field where the state of the art currently rests far short of that level of sophistication, a fundamental challenge has emerged: Namely, what will "success" even look like, should the day come when robots are able to perform these key tasks to human standards. To do these mundane but surprisingly complex tasks, a robot must be able to perceive, reason, and operate with full awareness of its own physical dimension and capabilities, but also of the world and objects around it. In robotics, this combination of situational and physical awareness and capability is known as embodied AI. Now, a multidisciplinary team of researchers at Stanford University has released the Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments (BEHAVIOR).


Can computers think? -- The north star in the quest for general intelligence

#artificialintelligence

Augusta Ada King, Countess of Lovelace, widely regarded as the world's first computer programmer, when talking about the Analytical Engine said, "The Analytical Engine has no pretensions whatever to originate anything" [1]. Hence, it is safe to say that the question "Can computers think?", in some form, not only predates the concept of Artificial Intelligence (AI) but is almost as old as the Analytical Engine. This question has stimulated the minds of pioneers and researchers from different domains including computer science, mathematics, psychology and philosophy. This essay delves into some of the important facets of this question. It is primarily driven by the thoughts and arguments of Alan M. Turing and John R. Searle, two pioneers who have extensively explored this question.


A Ranking Approach to Fair Classification

Schoeffer, Jakob, Kuehl, Niklas, Valera, Isabel

arXiv.org Artificial Intelligence

Algorithmic decision systems are increasingly used in areas such as hiring, school admission, or loan approval. Typically, these systems rely on labeled data for training a classification model. However, in many scenarios, ground-truth labels are unavailable, and instead we have only access to imperfect labels as the result of (potentially biased) human-made decisions. Despite being imperfect, historical decisions often contain some useful information on the unobserved true labels. In this paper, we focus on scenarios where only imperfect labels are available and propose a new fair ranking-based decision system, as an alternative to traditional classification algorithms. Our approach is both intuitive and easy to implement, and thus particularly suitable for adoption in real-world settings. More in detail, we introduce a distance-based decision criterion, which incorporates useful information from historical decisions and accounts for unwanted correlation between protected and legitimate features. Through extensive experiments on synthetic and real-world data, we show that our method is fair, as it a) assigns the desirable outcome to the most qualified individuals, and b) removes the effect of stereotypes in decision-making, thereby outperforming traditional classification algorithms. Additionally, we are able to show theoretically that our method is consistent with a prominent concept of individual fairness which states that "similar individuals should be treated similarly."